“Replication is the ultimate standard by which scientific claims are judged.” - Peng (2011)
Replication is one of the fundamental tenets of science
Findings from studies that cannot be independently replicated should be treated with caution!
Either they are not generalisable (cf. prediction) or worse, there was an error in the study!
The Reproducibility Crisis
Sadly, we have a problem…
‘Is there a reproducibility crisis?’ - A survey of >1500 scientists (Baker 2016; Penny 2016).
Reproducible Research
Makes use of modern software tools to share data, code, etc to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.
This is central to scientific progress!!!
BONUS: working reproducibly facilitates automated workflows, which is useful for applications like iterative near-term ecological forecasting!
Replication vs Reproducibility
Reproducibility falls short of full replication because it focuses on reproducing the same result from the same data set, rather than analyzing independently collected data.
This difference may seem trivial, but you’d be surprised at how few studies are even reproducible, let alone replicable.
Replication and the Reproducibility Spectrum
Full replication is a huge challenge, and sometimes impossible, e.g.
rare phenomena, long term records, very expensive projects like space missions, etc
Where the “gold standard” of full replication cannot be achieved, we have to settle for a lower rung somewhere on The Reproducibility Spectrum(Peng 2011)
Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.
A DMP is a living document and should be regularly revised during the life of a project!
Collect & Assure
I advocate that it is foolish to collect data without doing quality assurance and quality control (QA/QC) as you go, irrespective of how you are collecting the data.
An example data collection app I built in AppSheet that allows you to log GPS coordinates, take photos, record various fields, etc.
There are many tools that allow you to do quality assurance and quality control as you collect the data (or progressively shortly after data collection events). See Epicollect or the QField plug-in for QGIS. Even just MS Excel or GoogleSheets with controlled fields etc.
A key consideration here is data provenance - the history of the data, including where it came from, how it was collected, and any transformations or analyses that have been applied.
It is very easy to lose track of data provenance, especially before long-term storage, and especially when working with large datasets or multiple collaborators. This can lead to confusion and errors in the analysis.
Ideally, you never alter the raw data directly. Use a code script to create a “clean” or “processed” version of the data that you work with. This keeps the raw data intact and the script documents all changes.
Discover
Data that are not shared are effectively lost to science, and thus cannot be discovered or built upon by others.
Preserving your data in a well-known database with good metadata and a clear license allows others to discover and reuse your data, thus facilitating scientific progress.
There are many license options like the Creative Commons suite, which allow you to specify how your data can be used by others, while still giving you credit for your work. For software or code, there are other specific licenses like MIT or GPL, but these are not usually used for data.
Creative Commons Licenses:
CCO = it is Open - i.e. no restrictions
CC BY = by attribution
CC BY-SA = by attribution + share alike
CC BY-ND = by attribution + no derivatives
CC BY-NC = by attribution + non-commercial
CC BY-NC-SA = by attribution + non-commercial + share alike
CC BY-NC-ND - by attribution + non-commercial + no derivatives
Integrate & Analyse
“The fun bit”, but again, there are many things to bear in mind and keep track of so that your analysis is repeatable. This is largely covered by the sections on Coding and code management and Computing environment and software below
Project files and folders can get unwieldy fast and really bog you down!
The main considerations are:
defining a simple, common, intuitive folder structure
using informative file names
version control where possible
e.g. GitHub, Google Docs, etc
Folders
Most projects have similar requirements
Here’s how I usually manage my folders:
“code”contains code for analyses
“data” often has separate “raw” and “processed” (or “clean”) folders
Large files (e.g. GIS) may be stored elsewhere
“output” contains figures and tables
Names should be
machine readable
avoid spaces and funny punctuation
support searching and splitting, e.g. “data_raw.csv”, “data_clean.csv” can be searched by keywords and split into fields by “_”
human readable
contents self evident from the name
support sorting
numeric or character prefixes separate files by component or step
folder structure helps here too
3. Coding and code management
Why write code?
“Point-and-click” software like Excel, Statistica, SPSS etc may seem easier, but you’ll regret it in the long run… e.g. When you have to rerun or remember what you did?1
Coding rules
Coding is communication. Messy code is bad communication. Bad communication hampers collaboration and makes it easier to make mistakes…
“Point-and-click” software may seem easier, but you’ll regret it in the long run… e.g. When you have to rerun your analysis?
Code is essential for reproducibility and automation.
While many software now allow you to save what you did as a script or “macro”, but they are usually not open source and not easily shared or reused.
R, Python, etc are open source and allow you to do almost any analysis in one workflow - even calling other software.
Why write code?
Automation - reusing code is one click, and you’re unlikely to introduce errors
A script provides a record of your analysis
Uninterrupted workflows - scientific coding languages like Python or R allow you to run almost any kind of analysis in one scripted workflow
GIS, phylogenetics, multivariate or Bayesian statistics, etc
saves you manually exporting and importing data between software
Most coding languages are open source (e.g. R, Python, JavaScript, etc)
Free! No one has to pay to reuse any code you share
Transparent - You (and others) can check the background code and functions you’re using, not just the software company
A culture of sharing code (online forums, with publications, etc)
Some coding rules
It’seasytowritemessyindecipherablecode!!! - Write code for people, not computers!!!
Check out the Tidyverse style guide for R-specific guidance, but here are some basics:
use consistent, meaningful and distinct names for variables and functions
use consistent code and formatting style - indents, spaces, line-breaks, etc
modularize code into manageable steps/chunks
or separate scripts that can be called in order from a master script or Makefile
write functions rather than repeating the same code
use commenting to explain what you’re doing at each step or in each function
“notebooks” like RMarkdown, Quarto, Jupyter or Sweave allow embedded code, simplifying documentation, master/Makefiles, etc and can be used to write manuscripts, presentations or websites (e.g. all my teaching materials)
check for mistakes at every step!!! Do the outputs make sense?
Some coding rules continued…
start with a “recipe”
outline the steps/modules before you start coding to keep you on track
e.g. a common recipe in R (using commented headers):
#Header indicating purpose, author, date, version etc#Define settings and load required libraries#Read in data#Wrangle/reformat/clean/summarize data as required#Run analyses (often multiple steps)#Wrangle/reformat/summarize analysis outputs for visualization#Visualize outputs as figures or tables
avoid proprietary formats! i.e. use open source scripting languages and file formats
use version control!!!
Version control
Version control tools can be challenging , but also hugely simplify your workflow!
The advantages of version control1:
They generally help project management, especially collaborations
They allow easy code sharing with collaborators or the public at large - through repositories (“repos”) or gists (code snippets)
The system is online, but you can also work offline by cloning the repo to your local PC. You can “push to” or “pull from” the online repo to keep versions in sync
Changes are tracked and reversible through commits
Changes must be commited with a commit message - creating a recoverable version that can be compared or reverted
Version control magically frees you from duplicated files!
Version control continued…
Users can easily adapt or build on each others’ code by forking repos and working on their own branch.
This allows you to repeat/replicate analyses or even build websites (like this one!)
Collaborators can propose changes via pull requests
Repo owners can accept and integrate changes seamlessly by review and merge the forked branch back to the main branch
Comments associated with commit or pull requests provide a written record of changes and track the user, date, time, etc - all of which and are useful tracking mistakes and blaming when things go wrong
You can assign, log and track issues and feature requests
Version control - example workflow
Version control - example workflow
Interestingly, since all that’s tracked are the commits, whereby versions are named (the nodes in the image). All that the online Git repo records is this figure below. The black is the the OWNER’s main branch and the blue is the COLLABORATOR’s fork.
Sharing your code and data is not enough to maintain reproducibility…
Software and hardware change between users, with upgrades, versions or user community preferences!
You’ll all know MicroSoft Excel, but have you heard of Quattro Pro or Lotus that were the preferred spreadsheet software of yesteryear?
The lazy solution
You can document the hardware and versions of software used so that others can recreate that computing environment if needed.
In R, you can simply run the sessionInfo() function, giving details below
This just makes it someone else’s problem to recreate your computing environment (usually you!), which is not ideal…
R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Africa/Johannesburg
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_4.0.1
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
[5] xfun_0.52 generics_0.1.4 S7_0.2.1 jsonlite_2.0.0
[9] labeling_0.4.3 glue_1.8.0 htmltools_0.5.8.1 scales_1.4.0
[13] rmarkdown_2.29 grid_4.4.3 tibble_3.3.0 evaluate_1.0.4
[17] fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4 compiler_4.4.3
[21] dplyr_1.1.4 RColorBrewer_1.1-3 pkgconfig_2.0.3 rstudioapi_0.17.1
[25] farver_2.1.2 digest_0.6.39 R6_2.6.1 tidyselect_1.2.1
[29] dichromat_2.0-0.1 pillar_1.11.1 magrittr_2.0.4 withr_3.0.2
[33] tools_4.4.3 gtable_0.3.6
A better solution
If your entire workflow is within R, you can use the renv package to manage your R environment.
renv is essentially a package manager.
It creates a snapshot of your R environment, including all packages and their versions, so that anyone can recreate the same environment by running renv::restore()
Disadvantages are that it doesn’t manage for:
Different versions of R
Different operating systems
Software outside of R (e.g. JAGS, Stan, Python, GitHub etc)
The best solution?
Use containers like those provided by software like docker or singularity.
Containers provide “images” of contained, lightweight computing environments that you can package with your software/workflow to set up virtual machines with all the necessary software and settings etc.
You set your container up to have everything you need to run your workflow (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly every time.
Containers are usually based on Linux, because other operating systems are not free.
The Rocker project provides a set of Docker images for R and RStudio, which are widely used in the R community.
5. Sharing data, code, publication etc
This is covered by data management, but suffice to say there’s no point working reproducibly if you’re not going to share all the components necessary to complete your workflow…
Another key component here is that ideally all your data, code, publication etc are shared Open Access
not stuck behind some paywall!
not in a proprietary format or requiring proprietary software
A 3-step, 10-point checklist to guide researchers toward greater reproducibility (Alston and Rick 2021).
::::
Automation?
The key to iterating your workflow, especially for forecasting.
Many options!
Makefiles - a simple text file that defines how to run your code, e.g. in R, Python, etc
RMarkdown or Quarto - allow you to write code and text in the same document, which can be run to produce a report, website, etc
GitHub Actions - allows you to automate workflows in GitHub, e.g. running tests, building documentation, etc
R - R has many packages for automating workflows, e.g. targets
An example
The project aims to develop a near-real-time satellite change detection system for the Fynbos Biome using an ecological forecasting approach (www.emma.eco).
An example
EMMA Workflow
The workflow is designed to be run on a weekly basis, with new data ingested and processed automatically.
There are several steps, each of which is run automatically:
Data ingest - new data is downloaded from various APIs
Data processing - to extract the relevant info and reformat for analysis
Data analysis - the data is analysed to detect changes in the environment
Data visualization and sharing - via a Quarto website run from a GitHub repository
EMMA Workflow
Outputs a Quarto website, automatically built from a GitHub repository.
Processing and analysis done in R. Intermediate and final outputs stored as GitHub releases or in GitHub Large File Storage.
R workflow managed by the targets package
GitHub Actions used to automate and run the workflow
Docker container sets up the computing environment
All code, data, metadata, etc are shared on GitHub
EMMA targets Workflow
Example targets workflow from https://wlandau.github.io/targets-tutorial/#8
EMMA targets workflow
targets is an R package that allows you to define a workflow as a series of steps, each of which can be run automatically.
The package identifies which steps are out of date and runs them and their dependencies, but ignores unaffected steps, saving computation.
In EMMA, the workflow is defined as a series of R scripts, which is run automatically by GitHub Actions on a weekly basis, triggered by a GitHub runner. targets keeps track and controls which steps have been run and which need to be rerun depending on new data inputs, etc.
Unit testing
A key component of automation is unit testing
testing each component of your code to ensure it works as expected
This is a part of general coding and code management, but is especially important for forecasting, where you need to ensure that your code runs correctly on new data
R has many packages for unit testing, e.g. testthat and RUnit
References
Alston, Jesse M, and Jessica A Rick. 2021. “A beginner’s guide to conducting reproducible research.”Bulletin of the Ecological Society of America 102 (2). https://doi.org/10.1002/bes2.1801.
Baker, Monya. 2016. “1,500 scientists lift the lid on reproducibility.”Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Michener, William K, James W Brunt, John J Helly, Thomas B Kirchner, and Susan G Stafford. 1997. “Nongeospatial data for the ecological sciences.”Ecological Applications: A Publication of the Ecological Society of America 7 (1): 330–42. https://doi.org/10.1890/1051-0761(1997)007[0330:nmftes]2.0.co;2.